18 research outputs found

    Maghrebi Arabic dialect processing: an overview

    Get PDF
    International audienceNatural Language Processing for Arabic dialects has grown widely these last years. Indeed, several works were proposed dealing with all aspects of Natural Language Processing. However , some AD varieties have received more attention and have a growing collection of resources. Others varieties, such as Maghrebi, still lag behind in that respect. Maghrebi Arabic is the family of Arabic dialects spoken in the Maghreb region (principally Algeria, Tunisia and Morocco). In this work we are interested in these three languages. This paper presents a review of natural language processing for Maghrebi Arabic dialects

    Script Independent Morphological Segmentation for Arabic Maghrebi Dialects: An Application to Machine Translation

    Get PDF
    International audienceThis research deals with resources creation for under-resourced languages. We try to adapt existing resources for other resourced-languages to process less-resourced ones. We focus on Arabic dialects of the Maghreb, namely Algerian, Moroccan and Tunisian. We first adapt a well-known statistical word segmenter to segment Algerian dialect texts written in both Arabic and Latin scripts. We demonstrate that unsupervised morphological segmentation could be applied to Arabic dialects regardless of used script. Next, we use this kind of segmentation to improve statistical machine translation scores between the tree Maghrebi dialects and French. We use a parallel multidialectal corpus that includes six Arabic dialects in addition to MSA and French. We achieved interesting results. Regards to word segmentation, the rate of correctly segmented words reached 70% for those written in Latin script and 79% for those written in Arabic script. For machine translation, the unsupervised morphological segmentation helped to decrease out-of-vocabulary words rates by a minimum of 35%

    Creating Parallel Arabic Dialect Corpus: Pitfalls to Avoid

    Get PDF
    International audienceCreating parallel corpora is a difficult issue that many researches try to deal with. In the context of under-resourced languages like Arabic dialects this issue is more complicated due to the nature of these spoken languages. In this paper, we share our experiment of creating a Parallel Corpus which contain several dialects and Modern Standard Arabic(MSA). We attempt to highlight the most important choices that we did and how good were these choices

    Grapheme To Phoneme Conversion - An Arabic Dialect Case

    Get PDF
    International audienceWe aim to develop a speech translation system between Modern Standard Arabic and Algiers dialect. Such a system must include a Text-to-Speech module which itself must include a grapheme-phoneme converter. Algiers dialect is an Arabic dialect concerned by the most problems of Modern Standard Arabic in NLP area. Furthermore, it could be considered as an under-resourced language because it is a vernacular language for which no substantial corpus exists. In this paper we present a grapheme-to-phoneme converter for this language. We used a rule based approach and a statistical approach, we got an accuracy of 92% VS 85% despite the lack of resource for this language

    Machine Translation Experiments on PADIC: A Parallel Arabic DIalect Corpus

    Get PDF
    International audienceWe present in this paper PADIC, a Parallel Arabic DIalect Corpus we built from scratch, then we conducted experiments on cross-dialect Arabic machine translation. PADIC is composed of dialects from both the Maghreb and the Middle-East. Each dialect has been aligned with Modern Standard Arabic (MSA). Three dialects from Maghreb are concerned by this study: two from Algeria, one from Tunisia, and two dialects from the Middle-East (Syria and Palestine). PADIC has been built from scratch because the lack of dialect resources. In fact, Arabic dialects in Arab world in general are used in daily life conversations but they are not written. At the best of our knowledge, PADIC, up to now, is the largest corpus in the community working on dialects and especially those concerning Maghreb. PADIC is composed of 6400 sentences for each of the 5 concerned dialects and MSA. We conducted cross-lingual machine translation experiments between all the language pairs. For translating to MSA we interpolated the corresponding Language Model (LM) with a large Arabic corpus based LM. We also studied the impact of language model smoothing techniques on the results of machine translation because this corpus, even it is the largest one, it still very small in comparison to those used for translation of natural languages

    Cross-Dialectal Arabic Processing

    Get PDF
    International audienceWe present, in this paper an Arabic multi-dialect study including dialects from both the Maghreb and the Middle-east that we compare to the Modern Standard Arabic (MSA). Three dialects from Maghreb are concerned by this study: two from Algeria and one from Tunisia and two dialects from Middle-east (Syria and Palestine). The resources which have been built from scratch have lead to a collection of a multi-dialect parallel resource. Furthermore, this collection has been aligned by hand with a MSA corpus. We conducted several analytical studies in order to understand the relationship between these vernacular languages. For this, we studied the closeness between all the pairs of dialects and MSA in terms of Hellinger distance. We also performed an experiment of dialect identification. This experiment showed that neighbouring dialects as expected tend to be confused, making difficult their identification. Because the Arabic dialects are different from one region to another which make the communication between people difficult, we conducted cross-lingual machine translation between all the pairs of dialects and also with MSA. Several interesting conclusions have been carried out from this experiment

    Traduction Automatique Fondée sur des Méthodes Statistiques : Application aux Langues peu Dotées en Ressources

    Get PDF
    This work is dedicated to statistical machine translation for poorly resourced languages. We are interested in Arabic dialects which represent the daily language of all Arab peoples. These dialects differ from one Arab country to another and even in the same country several variations of dialects coexist. These dialects by their oral nature and non-standard represent a challenge in NLP. In machine translation, these dialects are difficult to translate because of the lack of resources (of all natures) in particular the monolingual and especially parallel corpora necessary for training. In this thesis, we are interested by this issue with particular attention to the Algerian dialect and more precisely to the Algiers dialect. A parallel multi-dialect PADIC corpus (for Parallel Arabic Dialect Corpus) has been created, this is a textual resource important which includes, so far, six Arabic dialects in addition to Modern Standard Arabic. This corpus was the subject of an analytical study to highlight the relationship between dialects (between them) and Standard Arabic. By means of the corpus PADIC, we tackled the problem of statistical machine translation between the different dialect pairs and Standard Arabic. Several results have been obtained and all point to the difficulty of translating dialects. In addition, several tools dedicated to the Algiers dialect have been produced in the framework of this thesis. The problem of code-switching was also discussed where an identification tool was implemented using techniques of "Machine Learning".Le présent travail s’articule autour de la traduction automatique statistique dans le cadre des langues peu dotées en ressources. On s’intéresse aux dialectes arabes qui représentent le parlé quotidien de tous les peuples arabes. Ces dialectes différent d’un pays arabe à un autre et même dans un même pays on constate l’existence de plusieurs variantes de dialectes. Ces dialectes de par leur nature orale et non-standard représentent un défi pour le domaine de traitement automatique des langues. Dans le cadre précis de la traduction automatique statistique, ces dialectes sont difficiles à prendre en charge à cause de l’absence de ressources (de toutes natures) notamment les corpus monolingues et surtout parallèles nécessaires pour l’apprentissage des différents modèles statistiques. Dans cette thèse, on s’intéresse à cette problématique avec une attention particulière au dialecte algérien et plus précisément le dialecte algérois. Un corpus parallèle multi-dialecte PADIC (pour Parallel Arabic Dialect Corpus) a été créé, il s’agit d’une ressource textuelle importante qui comprend, jusqu’à présent, six dialectes arabes en plus de l’arabe standard. Ce corpus a fait l’objet d’une étude analytique pour mettre en relief la relation entre les dialectes (entre eux) et l’arabe standard. Au moyen du corpus PADIC, on s’est attaqué au problème de la traduction automatique statistique entre les différentes paires de dialectes et l’arabe standard. Plusieurs résultats ont été obtenus et vont tous dans le sens de la difficulté de la traduction des dialectes. Par ailleurs, plusieurs outils dédiés au dialecte algérois ont été réalisés dans le cadre de cette thèse. Le problème du code-switching a été aussi abordé au cours de ce travail où un outil d’identification a été mis en œuvre grâce aux techniques du « Machine Learning »

    Machine translation for Arabic dialects (survey)

    Get PDF
    International audienceArabic dialects also called colloquial Arabic or vernaculars are spoken varieties of Standard Arabic. These dialects have mixed form with many variations due to the influence of ancient local tongues and other languages like European ones. Many of these dialects are mutually incomprehensible. Arabic dialects were not written until recently and were used only in a speech form. Nowadays, with the advent of the internet and mobile telephony technologies, these dialects are increasingly used in a written form. Indeed, this kind of communication brought everyday conversations to a written format. This allows Arab people to use their dialects, which are their actual native languages for expressing their opinion on social media, for chatting, texting, etc. This growing use opens new research direction for Arabic natural language processing (NLP). We focus, in this paper, on machine translation in the context of Arabic dialects. We provide a survey of recent research in this area. We report for each study a detailed description of the adopted approach and we give its most relevant contribution

    Maghrebi Arabic dialect processing: an overview

    No full text
    International audienceNatural Language Processing for Arabic dialects has grown widely these last years. Indeed, several works were proposed dealing with all aspects of Natural Language Processing. However , some AD varieties have received more attention and have a growing collection of resources. Others varieties, such as Maghrebi, still lag behind in that respect. Maghrebi Arabic is the family of Arabic dialects spoken in the Maghreb region (principally Algeria, Tunisia and Morocco). In this work we are interested in these three languages. This paper presents a review of natural language processing for Maghrebi Arabic dialects
    corecore